17 research outputs found
Emergency Behavior for a Connected Smartphone
Smartphones today are capable of being connected to one or more devices via wireless communication technologies, such as Bluetooth®. While a smartphone is connected (or “paired”) with a device, an audio-distribution profile of the smartphone governs routing of audio to and from mechanisms (e.g., speakers, microphones) of either the smartphone or a device to which the smartphone is paired. A protocol that routes audio, based on an input telephone number being an emergency telephone number, is described
Dialogue Act Modeling for Automatic Tagging and Recognition of Conversational Speech
We describe a statistical approach for modeling dialogue acts in
conversational speech, i.e., speech-act-like units such as Statement, Question,
Backchannel, Agreement, Disagreement, and Apology. Our model detects and
predicts dialogue acts based on lexical, collocational, and prosodic cues, as
well as on the discourse coherence of the dialogue act sequence. The dialogue
model is based on treating the discourse structure of a conversation as a
hidden Markov model and the individual dialogue acts as observations emanating
from the model states. Constraints on the likely sequence of dialogue acts are
modeled via a dialogue act n-gram. The statistical dialogue grammar is combined
with word n-grams, decision trees, and neural networks modeling the
idiosyncratic lexical and prosodic manifestations of each dialogue act. We
develop a probabilistic integration of speech recognition with dialogue
modeling, to improve both speech recognition and dialogue act classification
accuracy. Models are trained and evaluated using a large hand-labeled database
of 1,155 conversations from the Switchboard corpus of spontaneous
human-to-human telephone speech. We achieved good dialogue act labeling
accuracy (65% based on errorful, automatically recognized words and prosody,
and 71% based on word transcripts, compared to a chance baseline accuracy of
35% and human accuracy of 84%) and a small reduction in word recognition error.Comment: 35 pages, 5 figures. Changes in copy editing (note title spelling
changed
Can Prosody Aid the Automatic Classification of Dialog Acts in Conversational Speech?
Identifying whether an utterance is a statement, question, greeting, and so forth is integral to effective automatic understanding of natural dialog. Little is known, however, about how such dialog acts (DAs) can be automatically classified in truly natural conversation. This study asks whether current approaches, which use mainly word information, could be improved by adding prosodic information. The study examines over 1000 conversations from the Switchboard corpus. DAs were handannotated, and prosodic features (duration, pause, F0, energy and speakingrate features) were automatically extracted for each DA. In training, decision trees based on these features were inferred; trees were then applied to unseen test data to evaluate performance. For an allway classification as well as three subtasks, prosody allowed highly significant classification
over chance. Featurespecific analyses further revealed that although canonical features (such as F0 for questions) were important, less obvious features could compensate if canonical features were removed. Finally, in each task, integrating the prosodic model with a DAspecific
statistical language model improved performance over that of the language model alone. Results suggest that DAs are redundantly marked
in natural conversation, and that a variety of automatically extractable prosodic features could aid dialog processing in speech applications
Automatic detection of discourse structure for speech recognition and understanding.
We describe a new approach for statistical modeling and detection of discourse structure
for natural conversational speech. Our model is based on 42 ‘Dialog Acts’ (DAs),
(question, answer, backchannel, agreement, disagreement, apology, etc). We labeled
1155 conversations from the Switchboard (SWBD) database (Godfrey et al. 1992) of
human-to-human telephone conversations with these 42 types and trained a Dialog Act
detector based on three distinct knowledge sources: sequences of words which characterize
a dialog act, prosodic features which characterize a dialog act, and a statistical
Discourse Grammar. Our combined detector, although still in preliminary stages, already
achieves a 65% Dialog Act detection rate based on acoustic waveforms, and 72%
accuracy based on word transcripts. Using this detector to switch among the 42 Dialog-
Act-Specific trigram LMs also gave us an encouraging but not statistically significant
reduction in SWBD word error
Towards Better Integration Of Semantic Predictors In Statistical Language Modeling
We introduce a number of techniques designed to help integrate semantic knowledge with N-gram language models for automatic speech recognition. Our techniques allow us to integrate Latent Semantic Analysis (LSA), a word-similarity algorithm based on word co-occurrence information, with N-gram models. While LSA is good at predicting content words which are coherent with the rest of a text, it is a bad predictor of frequent words, has a low dynamic range, and is inaccurate when combined linearly with N-grams. We show that modifying the dynamic range, applying a per-word confidence metric, and using geometric rather than linear combinations with N-grams produces a more robust language model which has a lower perplexity on a Wall Street Journal testset than a baseline N-gram model. 1. INTRODUCTION There has been a lot of recent work on augmenting n-gram language models with other information sources such as longer distance syntactic, and semantic constraints (e.g. [8], [6]). In previous ..
Automatic Detection of Discourse Structure for Speech Recognition and Understanding
We describe a new approach for statistical modeling and detection of discourse structure for natural conversational speech. Our model is based on 42 `Utterance Types' (UTs), (question, answer, backchannel, agreement, disagreement, apology, etc). We labeled 1155 conversations from the Switchboard (SWBD) database (Godfrey et al. 1992) of human-to-human telephone conversations with these 42 types and trained an Utterance Type detector based on three distinct knowledge sources: sequences of words which characterize an utterance type, prosodic features which characterize an utterance type, and a statistical Discourse Grammar. Our combined detector, although still in preliminary stages, already achieves a 65% Utterance Type detection rate. Using this detector to switch among the 42 Utterance-Type-Specific trigram LMs also gave us an encouraging but not statistically significant reduction in SWBD word error. 1. INTRODUCTION The ability to model and automatically detect discourse structure i..